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Abstract 

Unknown constraints arise in many types of ex¬ 
pensive black-box optimization problems. Sev¬ 
eral methods have been proposed recently for 
performing Bayesian optimization with con¬ 
straints, based on the expected improvement (El) 
heuristic. However, El can lead to pathologies 
when used with constraints. For example, in the 
case of decoupled constraints—i.e., when one 
can independently evaluate the objective or the 
constraints—El can encounter a pathology that 
prevents exploration. Additionally, computing 
El requires a current best solution, which may 
not exist if none of the data collected so far sat¬ 
isfy the constraints. By contrast, information- 
based approaches do not suffer from these fail¬ 
ure modes. In this paper, we present a new 
information-based method called Predictive En¬ 
tropy Search with Constraints (PESC). We ana¬ 
lyze the performance of PESC and show that it 
compares favorably to El-based approaches on 
synthetic and benchmark problems, as well as 
several real-world examples. We demonstrate 
that PESC is an effective algorithm that provides 
a promising direction towards a unified solution 
for constrained Bayesian optimization. 
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1. Introduction 

We are interested in finding the global minimum x, of an 
objective function /(x) over some bounded domain, typi¬ 
cally X C R d , subject to the non-negativity of a series of 
constraint functions c\,... ,ck- This can be formalized as 

min /(x) s.t. Ci(x) > 0,..., Cjr(x) > 0 . (1) 

x£ X 

However, / and ci,..., Ck are unknown and can only 
be evaluated pointwise via expensive queries to black¬ 
boxes that provide noise-corrupted evaluations of / and 
ci,..., ck- We assume that / and each of the con¬ 
straints Cfc are defined over the entire space X. We seek 
to find a solution to (1) with as few queries as possible. 
Bayesian optimization (Mockus et al., 1978) methods ap¬ 
proach this type of problem by building a Bayesian model 
of the unknown objective function and/or constraints, us¬ 
ing this model to compute an acquisition function that rep¬ 
resents how useful each input x is thought to be as a next 
evaluation, and then maximizing this acquisition function 
to select a suggestion for function evaluation. 

In this we work we extend Predictive Entropy Search (PES) 
(Hernandez-Lobato et al., 2014) to solve (1), an approach 
that we call Predictive Entropy Search with Constraints 
(PESC). PESC is an acquisition function that approximates 
the expected information gain about the value of the con¬ 
strained minimizer x*. As we will show below, PESC is 
effective in practice and can be applied to a much wider 
variety of constrained problems than existing methods. 
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2. Related Work and Challenges 


2.3. Integrated expected conditional improvement 


Most previous approaches to Bayesian optimization with 
unknown constraints are variants of expected improvement 
(El) (Mockus et al., 1978; Jones et ah, 1998). El measures 
the expected amount by which observing at x leads to im¬ 
provement over the current best value or incumbent ry. 

a E i(x|? 7 ,X>) =J max(0,/(x) —r?)p(/(x)|D)d/(x), (2) 

where V is the collected data. 

2.1. Expected improvement with constraints 

One way to use El with constraints works by discounting El 
by the posterior probability of a constraint violation. The 
resulting acquisition function, which we call expected im¬ 
provement with constraints (EIC), is given by 

K 

<* eic ( x ) = a E i(x\ri,T> f ) JJp(c fc (x) > 0|£> fc ), (3) 

k—1 

where T>f is the set of objective function observations and 
T> k is the set of observations for constraint k. Initially pro¬ 
posed by Schonlau et al. (1998), EIC has recently been 
independently developed in Snoek (2013); Gelbart et al. 
(2014); Gardner et al. (2014). In the constrained case, p 
is the smallest value of the posterior mean of / such that all 
the constraints are satisfied at the corresponding location. 


Gramacy & Lee (2011) propose an acquisition function 
based on the integrated expected conditional improvement 
(IECI), which is given by 

amci(x) = J [a m (x') - a m (x'|x)] h(x')dx.' , (4) 

where a E i(x') is the expected improvement at x' 
and a E i(x|x') is the expected improvement at x' when the 
objective has been evaluated at x, but without knowing the 
value obtained. The IECI at x is the expected reduction in 
improvement at x' under the density /i(x') caused by ob¬ 
serving the objective at that location, where h(x!) is the 
probability of all the constraints being satisfied at x'. Gel¬ 
bart et al. (2014) compare IECI with EIC for optimizing 
the hyper-parameters of a topic model with constraints on 
the entropy of the per-topic word distribution and show that 
EIC outperforms IECI for this problem. 

2.4. Expected volume reduction 

Picheny (2014) proposes to sequentially explore the loca¬ 
tion that yields that largest the expected volume reduction 
(EVR) of the feasible region below the best feasible objec¬ 
tive value r] found so far. This quantity is given by integrat¬ 
ing the product of the probability of improvement and the 
probability of feasibility. That is, 

«evr(x) = - J p[/(x') < min(ry, /(x))]/i(x , )dx / , (5) 


2.2. Augmented Lagrangian 

Gramacy et al. (2014) propose a combination of the 
expected improvement heuristic and the augmented La¬ 
grangian (AL) optimization framework for constrained 
blackbox optimization. AL methods are a class of algo¬ 
rithms for constrained nonlinear optimization that work by 
iteratively optimizing the unconstrained AL: 


L a (x\X,p) 


K 


f( x ) + 

k= 1 


1 

2 P 


min(0, Cfc(x )) 2 


A feC fe (x) 


where p > 0 is a penalty parameter and A > 0 is an ap¬ 
proximate Lagrange multiplier, both of which are updated 
at each iteration. 

The method proposed by Gramacy et al. (2014) uses 
Bayesian optimization with El to solve the unconstrained 
inner loop of the augmented Lagrangian formulation. AL 
is limited by requiring noiseless constraints so that p and 
A can be updated at each iteration. In section 4.3 we show 
that PESC and EIC perform better than AL on the synthetic 
benchmark problem considered in Gramacy et al. (2014), 
even when the AL method has access to the true objective 
function and PESC and EIC do not. 


where, as in IECI, Ji(x') is the probability that the con¬ 
straints are satisfied at x'. This step-wise uncertainty re¬ 
duction approach is similar to PESC in that both methods 
work by reducing a specific type of uncertainty measure 
(entropy for PESC and expected volume for EVR). 

2.5. Challenges 

El-based methods for constrained optimization have sev¬ 
eral issues. First, when no point in the search space is 
feasible under the above definition, p does not exist and 
the El cannot be computed. This issue affects EIC, IECI, 
and EVR. To address this issue, Gelbart et al. (2014) mod¬ 
ify EIC to ignore the factor EI(x|? 7 , D ‘) in (3) and only 
consider the posterior probability of the constraints being 
satisfied when p is not defined. The resulting acquisition 
function focuses only on searching for a feasible location 
and ignores learning about the objective /. 

Furthermore, Gelbart et al. (2014) identify a pathology with 
EIC when one is able to separately evaluate the objective 
or the constraints, i.e., the decoupled case. The best solu¬ 
tion x* must satisfy a conjunction of low objective value 
and high (non-negative) constraint values. By only evalu¬ 
ating the objective or a single constraint, this conjunction 
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cannot be satisfied by a single observation under a myopic 
search policy. Thus, the new observed x cannot become the 
new incumbent as a result of a decoupled observation and 
the expected improvement is zero. Therefore standard EIC 
fails in the decoupled setting. Gelbart et al. (2014) circum¬ 
vent this pathology by treating decoupling as a special case 
and using a two-stage acquisition function: first, x is cho¬ 
sen with EIC, and then, given x, the task (whether to eval¬ 
uate the objective or one of the constraints) is chosen with 
the method in Villemonteix et al. (2009). This approach 
does not take full advantage of the available information 
in the way a joint selection of x and the task would. Like 
EIC, the methods AL, IECI, and EVR are also not easily 
extended to the decoupled setting. 

In addition to this difficulties, EVR and IECI are limited 
by having to compute the integrals in (4) and (5) over the 
entire domain, which is done numerically over a grid on x' 
(Gramacy & Lee, 2011; Picheny, 2014). The resulting ac¬ 
quisition function must then be globally optimized, which 
also requires a grid on x. This nesting of grid operations 
limits the application of this method to small d. 

Our new method, PESC, does not suffer from these 
pathologies. First, the PESC acquisition function does not 
depend on the current best feasible solution, so it can op¬ 
erate coherently even when there is not yet a feasible solu¬ 
tion. Second, PESC naturally separates the contribution of 
each task (objective or constraint) in its acquisition func¬ 
tion. As a result, no pathology arises in the decoupled case 
and, thus, no ad hoc modifications to the acquisition func¬ 
tion are required. Third, likewise EVR and IECI, PESC 
also involves computing a difficult integral (over the poste¬ 
rior on x*). However, this can be done efficiently using the 
sampling approach described in Hernandez-Lobato et al. 
(2014). Furthermore, in addition to its increased general¬ 
ity, our experiments show that PESC performs favorably 
when compared to EIC and AL even in the basic setting of 
joint evaluations to which these methods are most suited. 

3. Predictive entropy search with constraints 

We seek to maximize information about the location x*, the 
constrained global minimum, whose posterior distribution 
is p(x*|2?°,..., V K ). We assume that / and ci,..., Ck 
follow independent Gaussian process (GP) priors (see, e.g., 
Rasmussen & Williams, 2006) and that observation noise 
is i.i.d. Gaussian with zero mean. GPs are widely-used 
probabilistic models for Bayesian nonparametric regres¬ 
sion which provide a flexible framework for working with 
unknown response surfaces. 

In the coupled setting we will let V = {(x n , y n )} n <jv de¬ 
note all the observations up to step TV, where y„ is a vec¬ 
tor collecting the objective and constraint observations at 


step n. The next query x.v+i can then be defined as that 
which maximizes the expected reduction in the differential 
entropy H[•] of the posterior on x*. We can write the PESC 
acquisition function as 

a(x) = H [x*|£>] - E y {H [x*|X> U (x, y)]} (6) 

where the expectation is taken with respect to the posterior 
distribution on the noisy evaluations of / and C\. ... ,c K 
at x, that is, p(y\T>, x). 

The exact computation of the above expression is infeasi¬ 
ble in practice. Instead, we follow Houlsby et al. (2012); 
Hernandez-Lobato et al. (2014) and take advantage of the 
symmetry of mutual information, rewriting this acquisition 
function as the mutual information between y and x* given 
the collected data V. That is, 

a( x ) = H [y| D, x] - E Xi {H [y| D, x, x*]} (7) 

where the expectation is now with respect to the posterior 
p(x*|X>) and where p( y|2?,x,x*) is the posterior predic¬ 
tive distribution for objective and constraint values given 
past data and the location of the global solution to the con¬ 
strained optimization problem x*. We call j)(y\'D. x. x*) 
the conditioned predictive distribution (CPD). 

The first term on the right-hand side of (7) is straightfor¬ 
ward to compute: it is the entropy of a product of indepen¬ 
dent Gaussians, which is given by 

k AT + 1 

H(y \V, x) = log Vf + Y^ lo S v k H- 2 — lo g( 27re ) > ( 8 ) 

fc =i 

where Vf and Vk are the predictive variances of the ob¬ 
jective and constraints, respectively. However, the second 
term in the right-hand side of (7) has to be approximated. 
For this, we first approximate the expectation by averaging 
over samples of x, approximately drawn from p(x*|D). To 
sample x*, we first approximately draw / and c \..... c/< 
from their GP posteriors using a finite parameterization of 
these functions. Then we solve a constrained optimiza¬ 
tion problem using the sampled functions to yield a sample 
of x*. This optimization approach is an extension of the 
approach described in more detail by Hernandez-Lobato 
et al. (2014), extended to the constrained setting. For each 
value of x* generated by this procedure, we approximate 
the CPD p(y\T>, x, x,) as described in the next section. 

3.1. Approximating the CPD 

Let z = [/(x), ci(x),..., Ck (x)] ^ denote the concate¬ 
nated vector of the noise-free objective and constraint val¬ 
ues at x. We can approximate the CPD by first approx¬ 
imating the posterior predictive distribution of z condi¬ 
tioned on V, x, and x*, which we call the noise free CPD 
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(NFCPD), and then convolving that approximation with ad- rather than the full /, c 1 ,..., c k. We first approximate the 

ditive Gaussian noise of variance erg,..., cr 2 K . factors in (11) that do not depend on x as 


We first consider the distribution p(x* | /, ci,..., Ck). The 
variable x* is in fact a deterministic function of the latent 
functions /, ci,..., Ck- in particular, x* is the global mini- 
mizer if and only if (i) all constraints are satisfied at x* and 
(ii) /(x*) is the smallest feasible value in the domain. We 
can informally translate these deterministic conditions into 
a conditional probability: 


p(x* I /,Cl, ...,C K ) 


n e M x *)] 


_k=l 


n 

x' ex 


(9) 


where 'F(x') is defined as 


n e [ c fe( x ')]) o[/( x ') - /( x *)]+( 1 - n 9 m*)] 


and the symbol 0 denotes the Heaviside step function with 
the convention that 0(0) = 1. The first product in (9) en¬ 
codes condition (i) and the infinite product over T (x'j en¬ 
codes condition (ii). Note that 'f'(x') also depends on x* 
and /, Ci,..., ck ; we use the notation 'T (x') for brevity. 


Because z is simply a vector containing the values of 
/, Ci,..., Ck at x, z is also a deterministic function of 
/, ci,..., Ck and we can write p (z | /, Ci,..., ck , x) us¬ 
ing Dirac delta functions to pick out the values at x: 


K 

p(z | /, ci,..., c K , x) m 5{z 0 - /(x)] ]^[ 5[z k - c fc (x)]. (10) 

k =1 


We can now write the NFCPD by i) noting that z is inde¬ 
pendent of x* given /, c i,..., ck. ii) multiplying the prod¬ 
uct of (9) and (10) by p(/, Ci,..., ck\T>) and iii) integrat¬ 
ing out the latent functions /, ci,..., Ck- 


p( z \D, x, x*) oc J 5 [zq — /(x)] [nf=i S[z k -c k {x)]\ 
[nf =1 0 M x *)]] [n xVx T(x')]^(x) 

p(f,c 1 ,... ,c K \T>)df del... dc k , (11) 


9 i(f, ci, ■ ■ ■ ,c K ) = 

[nti ©[cfeoi] [nli ®( x ~)] p( f, ci,..., c k \d) (i 2 ) 

where p( f, Ci,..., ck\T>) is the GP predictive distribution 
for objective and constraint values. Because (12) is not 
tractable, we approximate the normalized version of q\ 
with a product of Gaussians using expectation propagation 
(EP) (Minka, 2001). In particular, we obtain 

Z^ 1 q 1 (f,c 1 ,...,c K ) ~ q 2 {f,c ly ...,c K ) = 

7V(f|m 0 , V 0 ) nli A/"(c fc |m fc , V fc ), (13) 

where Z 1 is the normalization constant of q\ and (mt, V/,.) 
for k = 0,..., K are the mean and covariance terms deter¬ 
mined by EP. See the supplementary material for details 
on the EP approximation. Roughly speaking, EP approxi¬ 
mates each tme (but intractable) factor in (12) with a Gaus¬ 
sian factor whose parameters are iteratively refined. The 
product of all these Gaussian factors produces a tractable 
Gaussian approximation to (12). 

We now approximate the portion of (11) that does depend 
on x, namely the first line and the factor 'T (x), by replacing 
the deltas with p(z|f, ci,..., c k), the K + 1 dimensional, 
Gaussian conditional distribution given by the GP priors 
on f,c\,... ,Ck- Our full approximation to (11) is then 

p(zjD,x,x*) « Zjf p(z|f,c 1 ,...,c Jf )T'(x) 

q 2 ( f, c 1; ..., c K ) df dc\ ■ ■ ■ dc K , (14) 

where Z 2 is a normalization constant. From here, we an¬ 
alytically marginalize out all integration variables except 
/o = /(x+); see the supplementary material for the full 
details. This calculation, and those that follow, must be re¬ 
peated for every x; however, the EP approximation in (13) 
can be reused over all x. After performing the integration, 
we arrive at 


where p(f , ci,..., ck\D) is an infinite-dimensional Gaus¬ 
sian given by the GP posterior on f,Ci,... ,Ck, and we 
have separated ’T(x) out from the infinite product over x'. 

We find a Gaussian approximation to (11) in several steps. 
The general approach is to separately approximate the fac¬ 
tors that do and do not depend on x, so that the computa¬ 
tions associated with the latter factors can be reused rather 
than recomputed for each x. In (11), the only factors that 
depend on x are the deltas in the first line, and T (x). 

Let f denote the (A' + 1 )-dimcnsional vector containing ob¬ 
jective function evaluations at x* and x 1; ..., x v , and de¬ 
fine constraint vectors ci,..., c k similarly. Then, we ap¬ 
proximate (11) by conditioning only on f and ci,..., c k. 


p( z|r>,x,x„)« 

K 

^( x )Af([ 2 o,/o]|mo, Vq) Y\Af(z k \m' k ,v' k )df 0 , (15) 

k =1 

where zq = /(x). Details on how to compute the means 
m[,..., m' K and variances v[,..., v' K , as well as the 2- 
dimensional mean vector nig and the 2x2 covariance ma¬ 
trix Vg can be found in the supplementary material. 

We perform one final approximation to (15). We approxi¬ 
mate this distribution with a product of independent Gaus¬ 
sians that have the same marginal means and variances as 
(15). This corresponds to a single iteration of EP; see the 
supplementary material for details. 
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3.2. The PESC acquisition function 


By approximating the NFCPD with a product of indepen¬ 
dent Gaussians, we can approximate the entropy in the 
CPD by performing the following operations. First, we add 
the noise variances to the marginal variances of our final 
approximation of the NFCPD and second, we compute the 
entropy with (8). The PESC acquisition function, which 
approximates (6), is then 


opesc(x) = |log« PD (x) + Y log?4 D (x) 
l k= 1 

m J2 \ lo g w PD ( x i x * m) ) + J2 lo s ^ PD ( x i x * m) ) f > 

m =1 i, k= 1 J 

( 16 ) 

where M is the number of samples drawn from p(x* \T>), 
x* n ' ) is the m-th of these samples, r; PD (x) and v PD (x) are 
the predictive variances for the noisy evaluations of / and 
Cfc at x, respectively, and •u^ PD (x|x* m ' 1 ) and r^ PD (x|xi m ' ) ) 
are the approximated marginal variances of the CPD for the 
noisy evaluations of / and Ck at x given that x* = x*"\ 
Marginalization of (16) over the GP hyper-parameters can 
be done efficiently as in Hernandez-Lobato et al. (2014). 



The PESC acquisition function is additive in the expected 
amount of information that is obtained from the evalua¬ 
tion of each task (objective or constraint) at any particu¬ 
lar location x. For example, the expected information gain 
obtained from the evaluation of / at x is given by the 


in (16). 


term iE!= 1 log^ D (x) -log^ PD (x|xi m) ) 

The other I\ terms in (16) measure the corresponding con¬ 
tribution from evaluating each of the constraints. This al¬ 
lows PESC to easily address the decoupled scenario when 
one can independently evaluate the different functions at 
different locations. In other words. Equation (16) is a sum 
of individual acquisition functions, one for each function 
that we can evaluate. Existing methods for Bayesian op¬ 
timization with unknown constraints (described in Section 
2) do not possess this desirable property. Finally, the com¬ 
plexity of PESC is of order 0(MKN 3 ) per iteration in the 
coupled setting. As with unconstrained PES, this is domi¬ 
nated by the cost of a matrix inversion in the EP step. 


4. Experiments 

We evaluate the performance of PESC through experi¬ 
ments with i) synthetic functions sampled from the GP 
prior distribution, ii) analytic benchmark problems previ¬ 
ously used in the literature on Bayesian optimization with 
unknown constraints and iii) real-world constrained opti¬ 
mization problems. 

For case i) above, the synthetic functions sampled from the 
GP prior are generated following the same experimental set 


up as in Hennig & Schuler (2012) and Hernandez-Lobato 
et al. (2014). The search space is the unit hypercube of 
dimension d, and the ground truth objective / is a sample 
from a zero-mean GP with a squared exponential covari¬ 
ance function of unit amplitude and length scale t = 0.1 in 
each dimension. We represent the function / by first sam¬ 
pling from the GP prior on a grid of 1000 points generated 
using a Halton sequence (see Leobacher & Pillichsham- 
mer, 2014) and then defining / as the resulting GP pos¬ 
terior mean. We use a single constraint function c\ whose 
ground truth is sampled in the same way as /. The evalu¬ 
ations for / and C\ are contaminated with i.i.d. Gaussian 
noise with variance aj = aj = 0.01. 

4.1. Accuracy of the PESC approximation 

We first analyze the accuracy of the approximation to (7) 
generated by PESC. We compare the PESC approximation 
with a ground truth for (7) obtained by rejection sampling 
(RS). The RS method works by discretizing the search 
space using a uniform grid. The expectation with respect 
to p(x*|2?„) in (7) is then approximated by Monte Carlo. 
To achieve this, / and ci,... ,Ck are sampled on the grid 
and the grid cell with positive ci,..., Cr- (feasibility) and 
the lowest value of / (optimality) is selected. For each sam¬ 
ple of x* generated by this procedure, H [p(y\T> n , x, x*)] is 
approximated by rejection sampling: we select those sam¬ 
ples of / and Ci,... ,ck whose corresponding feasible op¬ 
timal solution is the sampled x, and reject the other sam¬ 
ples. We then assume that the selected samples for / and 
c±,, Ck are independent and have Gaussian marginal 
distributions. Under this assumption, H [p(y\V n , x, x*)] 
can be approximated using the formula for the entropy 
of independent Gaussian random variables, with the vari¬ 
ance parameters in this formula being equal to the empir¬ 
ical marginal variances of the selected samples of / and 
ci,..., ck at x plus the corresponding noise variances aj 
and aj,..., a 2 K . 

The left plot in Figure 1 shows the posterior distribution 
for / and ci given 5 evaluations sampled from the GP prior 
with d = 1. The posterior is computed using the optimal 
GP hyperparameters. The corresponding approximations 
to (7) generated by PESC and RS are shown in the middle 
plot of Figure 1 . Both PESC and RS use a total of 50 sam¬ 
ples from p(x*|D„) when approximating the expectation 
in (7). The PESC approximation is very accurate, and im¬ 
portantly its maximum value is very close to the maximum 
value of the RS approximation. 

One disadvantage of the RS method is its high cost, which 
scales with the size of the grid used. This grid has to be 
large to guarantee good performance, especially when d is 
large. An alternative is to use a small dynamic grid that 
changes as data is collected. Such a grid can be obtained 
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Figure 1. Assessing the accuracy of the PESC approximation, (a) Marginal posterior predictive distributions for the objective and 
constraint given some collected data denoted by x’s. (b) PESC and RS acquisition functions given the data in (a), (c) Median utility gap 
for PESC, RS and RSDG in the experiments with synthetic functions sampled from the GP prior with d = 1. 


by sampling from p(x*|2? n ) using the same approach as in 
PESC. The samples obtained would then form the dynamic 
grid. The resulting method is called Rejection Sampling 
with a Dynamic Grid (RSDG). 

We compare the performance of PESC, RS and RSDG in 
experiments with synthetic data corresponding to 500 pairs 
of / and ci sampled from the GP prior with d = 1. At 
each iteration, RSDG draws the same number of samples 
of x* as PESC. We assume that the GP hyperparameter 
values are known to each method. Recommendations are 
made by finding the location with lowest posterior mean 
for / such that Ci is non-negative with probability at least 
1 — <5i, where = 0.05. For reporting purposes, we set 
the utility u(x) of a recommendation x to be /(x) if x sat¬ 
isfies the constraint, and otherwise a penalty value of the 
worst (largest) objective function value achievable in the 
search space. For each recommendation at x, we compute 
the utility gap |u(x) — u(x*)|, where x* is the true solu¬ 
tion of the optimization problem. Each method is initial¬ 
ized with the same three random points drawn with Latin 
hypercube sampling. 

The right plot in Figure 1 shows the median of the utility 
gap for each method across the 500 realizations of / and Ci. 
The ;/:-axis in this plot is the number of joint function eval¬ 
uations for / and ci. We report the median because the 
empirical distribution of the utility gap is heavy-tailed and 
in this case the median is more representative of the lo¬ 
cation of the bulk of the data than the mean. The heavy 
tails arise because we are measuring performance across 
500 different optimization problems with very different de¬ 
grees of difficulty. In this and all following experiments, 
standard errors on the reported plot are computed using 
the bootstrap. The plot shows that PESC and RS are bet¬ 
ter than RSDG. Furthermore, PESC is very similar to RS, 
with PESC even performing slightly better at the end of the 
data collection process since PESC is not limited by a finite 


grid as RS is. These results show that PESC yields a very 
accurate approximation of the information gain. Further¬ 
more, although RSDG performs worse than PESC, RSDG 
is faster because the rejection sampling operation (with a 
small grid) is less expensive than the EP algorithm. Thus, 
RSDG is an attractive alternative to PESC when the avail¬ 
able computing time is very limited. 

4.2. Synthetic functions in 2 and 8 input dimensions 

We also compare the performance of PESC and RSDG 
with that of EIC (Section 2.1) using the same experimental 
protocol as in the previous section, but with dimensionali¬ 
ties d = 2 and d = 8. We do not compare with RS here be¬ 
cause its use of grids does not scale to higher dimensions. 
Figure 4.1 shows the utility gap for each method across 500 
different samples of / and c\ from the GP prior with d = 2 
(a) and d = 8 (b). Overall, PESC is the best method, fol¬ 
lowed by RSDG and EIC. RSDG performs similarly to 
PESC when d = 2, but is significantly worse when d = 8. 
This shows that, when d is high, grid based approaches (e.g. 
RSDG) are at a disadvantage with respect to methods that 
do not require a grid (e.g. PESC). 

4.3. A toy problem 

We compare PESC with EIC and AL (Section 2.2) in the 
toy problem described in Gramacy et al. (2014). We seek 
to minimize the function f(x) = X\ + #2, subject to the 
constraint functions C\ (x) > 0 and c 2 (x) > 0, given by 

Ci(x) = 0.5 sin (27r(a^ — 2 x 2 )) + x\ + 2xi — 1.5 , (17) 
c 2 (x) = — xf — x\ -(- 1.5 , (18) 

where x is confined to the unit square. The evaluations for 
/, Ci and C 2 are noise-free. We compare PESC and EIC 
with <5-| = 62 = 0.025 and a squared exponential GP ker¬ 
nel. PESC uses 10 samples from p(x*|2? n ) when approx- 
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(a) Optimizing GP samples in d = 2 (b) Optimizing GP samples in d = 8 


(c) 2d toy problem 


Figure 2. Assessing PESC on synthetic problems. (a,b) Compare PESC to EIC and RSDG on optimizing samples from the GP in 
dimension 2 and 8 respectively, and (c) compares PESC to AL and EIC. 


imating the expectation in (7). We use the AL implemen¬ 
tation provided by Gramacy et al. (2014) in the R package 
laGP which is based on the squared exponential kernel and 
assumes the objective / is known. Thus, in order for this 
implementation to be used, AL has an advantage over other 
methods in that it has access to the true objective function. 
In all three methods, the GP hyperparameters are estimated 
by maximum likelihood. 

Figure 2(c) shows the mean utility gap for each method 
across 500 independent realizations. Each realization cor¬ 
responds to a different initialization of the methods with 
three data points selected with Latin hypercube sampling. 
Here, we report the mean because we are now measuring 
performance across realizations of the same optimization 
problem and the heavy-tailed effect described in Section 
4.1 is less severe. The results show that PESC is signif¬ 
icantly better than EIC and AL for this problem. EIC is 
superior to AL, which performs slightly better at the begin¬ 
ning, presumably because it has access to the ground truth 
objective /. 

4.4. Finding a fast neural network 

In this experiment, we tune the hyperparamters of a three- 
hidden-layer neural network subject to the constraint that 
the prediction time must not exceed 2 ms on a GeForce 
GTX 580 GPU (also used for training). The search space 
consists of 12 parameters: 2 learning rate parameters (ini¬ 
tial and decay rate), 2 momentum parameters (initial and 
final), 2 dropout parameters (input layer and other lay¬ 
ers), 2 other regularization parameters (weight decay and 
max weight norm), the number of hidden units in each of 
the 3 hidden layers, the activation function (RELU or sig¬ 
moid). The network is trained using the deepnet package 1 , 
and the prediction time is computed as the average time of 
1000 predictions, each for a batch of size 128. The net¬ 
work is trained on the MNIST digit classification task with 

1 https://github.com/nitishsrivastava/deepnet 


momentum-based stochastic gradient descent for 5000 iter¬ 
ations. The objective is reported as the classification error 
rate on the validation set. As above, we treat constraint 
violations as the worst possible value (in this case a classi¬ 
fication error of 1.0). 

Figure 3(a) shows the results of 50 iterations of Bayesian 
optimization. In this experiment and the next, the y-axis 
represents observed objective values, Ai = 0.05, a Matern 
5/2 GP covariance kernel is used, and GP hyperparame¬ 
ters are integrated out using slice sampling (Neal, 2000) as 
in Snoek et al. (2012). Curves are the mean over 5 inde¬ 
pendent experiments. We find that PESC performs signif¬ 
icantly better than EIC. However, when the noise level is 
high, reporting the best objective observation is an overly 
optimistic metric (due to “lucky” evaluations); on the other 
hand, ground-truth is not available. Therefore, to validate 
our results further, we used the recommendations made at 
the final iteration of Bayesian optimization for each method 
(EIC and PESC) and evaluted the function with these rec¬ 
ommended parameters. We repeated the evaluation 10 
times for each of the 5 repeated experiments to compute 
a ground-truth score averaged of 50 function evaluations. 
This procedure yields a score of 7.0 ± 0.6% for PESC and 
49 ± 4% for EIC (as in the figure, constraint violations are 
treated as a classification error of 100%). This result is con¬ 
sistent with Figure 3(a) in that PESC performs significantly 
better than EIC, but also demonstrates that, due to noise. 
Figure 3(a) is overly optimistic. While we may believe this 
optimism to affect both methods equally, the ground-truth 
measurement provides a more reliable result and a much 
clearer understanding of the classification error attained by 
Bayesian optimization. 

4.5. Tuning Markov chain Monte Carlo 

Hybrid Monte Carlo, also known as Hamiltonian Monte 
Carlo (HMC), is a popular Markov Chain Monte Carlo 
(MCMC) technique that uses gradient information in a nu¬ 
merical integration to select the next sample. However, 
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(a) Tuning a fast neural network 


(b) Tuning Hamiltonian MCMC 


Figure 3. Comparing PESC and EIC for (a) minimizing classification error of a 3-hidden-layer neural network constrained to make 
predictions in under 2 ms, and (b) tuning Hamiltonian Monte Carlo to maximize the number of effective samples within 5 minutes of 
compute time. 


using numerical integration gives rise to new parameters 
like the integration step size and the number of integration 
steps. Following the experimental set up in Gelbart et al. 
(2014), we optimize the number of effective samples pro¬ 
duced by an HMC sampler limited to 5 minutes of com¬ 
putation time, subject to passing of the Geweke (Geweke, 
1992) and Gelman-Rubin (Gelman & Rubin, 1992) conver¬ 
gence diagnostics, as well as the constraint that the numer¬ 
ical integration should not diverge. We tune 4 parameters 
of an HMC sampler: the integration step size, number of 
integration steps, fraction of the allotted 5 minutes spent in 
burn-in, and an HMC mass parameter (see Neal, 2011). We 
use the coda R package (Plummer et al., 2006) to compute 
the effective sample size and the Geweke convergence di¬ 
agnostic, and the PyMC python package (Patil et al., 2010) 
to compute the Gelman-Rubin diagnostic over two inde¬ 
pendent traces. Following Gelbart et al. (2014), we impose 
the constraints that the absolute value of the Geweke test 
score be at most 2.0 and the Gelman-Rubin score be at most 
1.2, and sample from the posterior distribution of a logistic 
regression problem using the UCI German credit data set 
(Frank & Asuncion, 2010). 

Figure 3(b) evaluates EIC and PESC on this task, averaged 
over 10 independent experiments. As above, we perform a 
ground-truth assessment of the final recommendations. The 
average effective sample size is 3300 ± 1200 for PESC and 
2300 ± 900 for EIC. From these results we draw a similar 
conclusion to that of Figure 3(b); namely, that PESC out¬ 
performs EIC but only by a small margin, and furthermore 
that the experiment is very noisy. 


5. Discussion 

In this paper, we addressed global optimization with un¬ 
known constraints. Motivated by the weaknesses of exist¬ 
ing methods, we presented PESC, a method based on the 
theoretically appealing expected information gain heuris¬ 
tic. We showed that the approximations in PESC are quite 
accurate, and that PESC performs about equally well to a 
ground truth method based on rejection sampling. In sec¬ 
tions 4.2 to 4.5, we showed that PESC outperforms current 
methods such as EIC and AL over a variety of problems. 
Furthermore, PESC is easily applied to problems with de¬ 
coupled constraints, without additional computational cost 
or the pathologies discussed in Gelbart et al. (2014). 

One disadvantage of PESC is that it is relatively difficult to 
implement: in particular, the EP approximation often leads 
to numerical instabilities. Therefore, we have integrated 
our implementation, which carefully addresses these nu¬ 
merical issues, into the open-source Bayesian optimization 
package Spearmint at https : //github. com/HIPS/ 
Spearmint/tree/PESC. We have demonstrated that 
PESC is a flexible and powerful method and we hope the 
existence of such a method will bring constrained Bayesian 
optimization into the standard toolbox of Bayesian opti¬ 
mization practitioners. 
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1 Description of Expectation Propagation 

PESC computes a Gaussian approximation to the NFCPD (main text, Eq. (11)) using Expectation Propagation 
(EP) (Minka, 2001). EP is a method for approximating a product of factors (often a single prior factor and multiple 
likelihood factors) with a tractable distribution, for example a Gaussian. EP generates a Gaussian approximation 
by approximating each individual factor with a Gaussian. The product all these Gaussians results in a single 
Gaussian distribution that approximates the product of all the exact factors. This is in contrast to the Laplace 
approximation which fits a single Gaussian distribution to the whole posterior. EP can be intuitively understood 
as fitting the individual Gaussian approximations by minimizing the Kullback-Leibler (KL) divergences between 
each exact factor and its corresponding Gaussian approximation. This would correspond to matching first and 
second moments between exact and approximate factors. However, EP does this moment matching in the context 
of all the other approximate factors, since we are ultimately interested in having a good approximation in regions 
where the overall posterior probability is high. Concretely, assume we wish to approximate the distribution 

N 

q(x) = II^( x ) 

n—1 


with the approximate distribution 


N 

?( x )=n ’ 

n —1 


(i) 


where each g„(x) is Gaussian with specific parameters. Consider now that we wish to tune the parameters of a 
particular approximate factor g n (x). Then, we define the cavity distribution g' n ( x ) as 


N 

= n -vw 

n'^n 


g(x) _ 

<7n(x) ‘ 


( 2 ) 


Instead of matching the moments of q n (x) and q n (x), EP matches the moments (minimizes the KL divergence) 
of g„(x)g\ n (x) and q n (x)q\ n (x) = g(x). This causes the approximation quality to be higher in places where the 
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entire distribution q(x) is high, at the expense of approximation quality in less relevant regions where q(x) is close 
to zero. To compute the moments of q„ (x)q\”(x) we use Eqs. (5.12) and (5.13) in Minka (2001), which give the 
first and second moments of a distribution p{pc)Af{x | m, V) in terms of the derivatives of its log partition function. 
Thus, given some initial value for the parameters of all the q„(x), the steps performed by EP are 

1. Choose an n. 

2. Compute the cavity distribution q'"(x) given by Eq. (2) using the formula for dividing Gaussians. 

3. Compute the first and second moments of q„ (x)q'"(x) using Eqs. (5.12) and (5.13) in Minka (2001). This 
yields an updated Gaussian approximation q(x) to q(x) with mean and variance given by these moments. 

4. Update q„(x) as the ratio between q(x) and q' n (x), using the formula for dividing Gaussians. 

5. Repeat steps 1 to 4 until convergence. 

The EP updates for all the q„(x) maybe be done in parallel by performing steps 1 to4 for n = 1,... ,N using the 
same q(x) for each n. After his, the new q(x) is computed, according to Eq. (1), as the product of all the newly 
updated q rl (x). For these latter computations, one uses the formula for multiplying Gaussians. 


2 The Gaussian approximation to the NFCPD 


In this section we fill in the details of computing a Gaussian approximation to q{ f, ci,..., c k), given by Eq. (12) 
of the main text. Recall that f = (/(x*), /(xi),..., /(x„)) and C& = (c(x*), c(xi),..., c(x„)), where k = 
1 ,,K. and the first element in these vectors is accessed with the index number 0 and the last one with the index 
number n. The expression for </(f, ci,..., c k) can be written as 


1 I" K 

q( f,c u ...,c K ) = — N (f\ mj red , Vp red ) n^{ 


lk=l 


K 


n 0 [ c m 


.k =1 


N 

n 

n= 1 


K 


C k I m pred’ V pred 


K 


n B[c fc ,„] > 0[/„ - fo] + \ 1 - n 


,fc=l 


fc=1 


(3) 


where nip red and Vp red are the mean and covariance matrix of the posterior distribution of f given the data in T)f 
and nip^ and are the mean and covariance matrix of the posterior distribution of c./, given the data in V k . In 
particular, from Rasmussen & Williams (2006) Eqs. 2.22-2.24, we have that 

m pred = K {(K f + vjiy'yf , 

Vp f red = K{ ) *-K{(K/ + ^ I )- 1 [K{] T , 

where K i is an (N + 1) x N matrix with the prior cross-covariances between elements of f and f t,..., /„ and 
K{. * is an (N + 1) x (N + 1) matrix with the prior covariances between elements of f and vj is the standard 
deviation of the additive Gaussian noise in the evaluations of /. Similarly, we have that 

m^ d = K^(K fc + i 

V^ d = + — K*(K fc + ^I) _ 1 [K^] t , 

where K* is an (N + 1) x N matrix with the prior cross-covariances between elements of and Ck,i ,..., Ck, n 
and K* * is an (N + 1) x ( N + 1) matrix containing the prior covariances between the elements of Ck and Vk is 
the standard deviation of the additive Gaussian noise in the evaluations of Ck■ We will refer to the non-Gaussian 
factors in Eq. (3) as 


K 


K 


f 0 , Cl,r, 


j C/c, 


»)= < n e e [/»* - M+\ i - n e ^ 


i 


fc=i 


(4) 
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(this is called x*, f, c 1; ..., c k) in the main text) and 


fffe(c fc ,o) — ©[Cfe,o] > 


( 5 ) 


such that Eq. (3) can be written as 




.., c K ) ocAT(f\ m pred , V pred ) 


r k 


n^i ■^pred ’ v p c i d ) 

.k =1 


" N 

| ^n(/n? /05 Cl,n? 
_n= 1 



' if 

17 9k{Ck, o) 

_fc=l 


(6) 


We then approximate the exact non-Gaussian factors h n (f n , /o, ci ?n ,..., c^ ?n ) and gk{ck,o) with corresponding 
Gaussian approximate factors /i n (/ n , /o, ci ?n ,..., c^ n ) and g/c(Qc,o)> respectively. By the assumed independence 
of the objective and constraints, these approximate factors take the form 


^n(/m /cb C 1 ,715 * • ’ 5 c fe> „) a exp 


^[/n /o]A ft „ [/„ / 0 ] T + [/„ /o]b 



K 


f[ exp 
fe=1 


1 2 
^ a h n c k . 


+ ^>h„ c k. 


(7) 


and 


3fc(c fe ,o) oc exp j-^a Sfc 4 )0 + & gfc c M j 


( 8 ) 


where A h n . bh n , «/, ri , 6 /, ti , a, /(: and b gk are the natural parameters of the respective Gaussian distributions. 
The right-hand side of Eq. (6) is then approximated by 


«( f . ct, ■ • •, c K ) ex AT (f | m£ ed , v£ ed ) 

" iV 

J7 ^7l(/n5 /05 ^1,715 
_7i—1 



rn 


Cfc 

pred’ 




" K 

17 9k(ck, 0 ) 

_fe=l 


(9) 


Because the h n (f n , /o, ci ?n ,..., c*; ?n ) and gk{ck, o) are Gaussian, they can be combined with the Gaussian terms 
in the first line of Eq. (9) and written as 


q( f,ci,...,c^) oc A/*(f | m f , V f ) 


K 


n 


rn 


;V c fc ) 


Lfe=l 


where, by applying the formula for products of Gaussians, we obtain 

i -i 


v f =|(v p f red )- +v* 

V Cfc = 


m f = V f 


(V p f red )~ 


m pred + m 


/ \ —1 ~ 

-1 

/ \ — i 

(V p ‘ d ) +V C * 

, m Cfc = V Cfc 

(Vp^) m pr g d + m Ct 


with the following definitions for V, m f , V Cfc , and m Cfc : 

• V f is an (N + 1) x (N + 1) precision matrix with entries given by 

- K,n = [ A fJi,i for n = 1,..., N, 

~ Vo, n = vlo = [Ah„] 1)2 for n = 1,..., N, 


( 10 ) 


( 11 ) 
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- «S,0 = En=l [ A tJ2,2> 

- and all other entries are zero. 

• m f is an (TV + 1)-dimensional vector with entries given by 

- rhf n = [ b />Ji for n = 1,..., TV, 

- = Eli KJ 2 . 

• v c <= is a (TV + 1) x (TV + 1) diagonal precision matrix with entries given by 

- = a hn for n = 1, ■ • •, TV, 

- ®o,o = a 9k- 

• m Cfe is an (TV + 1)-dimensional vector such that 

- fh^ k = b hn for n = 1,..., TV, 

- = K- 

In the next section we explain how to actually obtain the values of Aj lri , b, a/, n , b kn , a g) . and b gk by running EP. 


3 The EP approximation to h n and (j h . 

In this section we explain how to iteratively refine the approximate Gaussian factors h n and g k with EP. 


3.1 EP update operations for h n 

EP adjusts each h n by minimizing the KL divergence between 

hnifn, fo, Cl,n, ■■■, Cfc,n)<? V " (f, Cl 5 • • • 5 CK ) 


and 


hnifn, fo, Cl, n,---,Cfc,„)^ n (f, Cl, ..., C k) , 

where we define the cavity distribution g' n (f, ci,..., c k) as 


^"(f,Ci,...,C X ) = 


q{i,c 1 ,...,c K ) 

hn (/n^ /0i C 1 ,n? • • • ? C/c ,n) 


( 12 ) 

(13) 


Because g(f, ci,..., c^) and h n (f n , fo, ci >n ,..., Cfc, n ) are both Gaussian, g\ n (f, ci,..., c/<-) is also Gaussian. 
If we marginalize out all variables except those which h n depends on, namely f n , fo, Ci.„, ..., c k , n , then we have 
that q>(f n , fo, ci, n , • • •, c fej „) takes the form 


K 


Q Xn ifn, fo, Cl, n , ■■■, C k ,„) (X Af ([f n fo] I m ^ld° - V {" 0 ’id°) ^ ( Cfe ’ 

where, by applying the formula for dividing Gaussians, we obtain 


Cfc n Cfc n 

n I ^n,old’ ^n,old 


Lfc=l 


\rfnJo _ I [yf 
V n.old — ^ 1 V ' 


/n,/oJ 


— A h 




-1 


- 1 - b»,} 

{«c [<%] _1 - v,} . 


<oU = i I” 
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(14) 

(15) 

(16) 

(17) 

(18) 


4 



where Vy ^ is the 2x2 covariance matrix obtained by taking the entries corresponding to /.„ and f 0 from V f , 
and rriy ^ is the corresponding 2-dimensional mean vector. Similarly, v^ k n is the variance for Ck, n in Q and 11 Y 
is the corresponding mean. 

To minimize the KL divergence between Eqs. (12) and (13), we match the first and second moments of these 
two distributions. The moments of Eq. (12) can be obtained from the derivatives of its normalization constant. 
This normalization constant is given by 


Z — J h n (f n , /o> Cl ,n? • • • ? Cfe,n)^"(/n,/o i Cl,n> • • • 5 Cfc,n) dfm dfo, dc\ ,n> • • • 5 dc feirl 

= (7n $ [«»] 1 *(«»)+1 1 - n $ w] 11 ■> 


(19) 


.fe = l 


fc=l 


where a k n = rn^J^v^ and a n = [1, -1]V^°[1, -1] T . We follow Eqs. (5.12) and 

(5.13) Minka (2001) to update a,h n and bh„ ; however, we use the second partial derivatives with respect to 

Ct. n , 

V n, old ] 

Slog Z 


rather than first partial derivative with respect to for numerical robustness. These derivatives are given by 


dm.. 


ra,old Z$>(a k )y/v, 
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( 21 ) 


where (f> and $ are the standard Gaussian pdf and cdf, respectively. The update equations for the parameters a./, n 
and bh n of the approximate factor h are then 
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We now perform the analogous operations to update A hri and b /lri . We need to compute 

a log Z {nf=l $[ a n]} H a n) 
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where 


S = [-U]V^[-11] 


(23) 


We then compute the optimal mean and variance (those that minimize the KL-divergence) of the product in Eq. (13) 
by using Eqs. (5.12) and (5.13) from Minka (2001): 


[ V L,/Jnew=V^-V^ 


old 


d log Z ( d log Z 


Ulll n, old 


K Ulll n ,old , 


-2 


(9 log Z 

u v n,old 


\rfn Jo 
v n,old ’ 


m 


/n,/ 0 J neW 


— Irl n,old 


V 


UJp Slog z 


n,old 


am n,old 


(24) 


5 



Next we need to divide the Gaussian with parameters given by Eq. (24) by the Gaussian cavity distribution 
<P n {fn, /o) Ci >ra ,..., Ck, n )- Therefore the new parameters A h n and b/^ of the approximate factor h are obtained 
using the formula for the ratio of two Gaussians: 


[A/i„] new 
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(25) 


3.2 EP approximation to gi,- 

We now show how to refine the { 5 ^}- We adjust each gk by minimizing the KL divergence between 

5fc(cfe,o)^ fe (f,Ci,...,c Ar ) (26) 

and 

5fc(cfc,o)^ fc ( f , ci,, c K ), (27) 

where 


q Xk { f,a,...,c K ) = 


q{ f,Ci,...,Cif) 


9k(ck, 0 ) 

Because q( f, ci,..., c k) and gk{cu, 0 ) are both Gaussian, q^ k (f, Ci,..., c k) is also Gaussian. Furthermore, if we 
marginalize out all variables except those which % depends on, namely Ck, 0 , then q k (i. ci,..., c k) takes the 

form q\ k {c k , 0 ) = N ( Ck ,0 | m fc fe 0 id’ ^^oid) ’ where, by applying the formula for the ratio of Gaussians, m c k and 
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,Ckg, 

fc,old 


are given by 


v l%= {K?J 1 -«<?*} 


Cfc n _ 

TO fc,old = 


{ m n[ V n,n\ 1 “ K } 


(28) 


Similarly as in Section 3.1, to minimize the KL divergence between Eqs. (26) and (27), we match the first and 
second moments of these two distributions. The moments of Eq. (26) can be obtained from the derivatives of its 

normalization constant. This normalization constant is given by Z = $(a) where a = rn£' 0 J* d / yj . Then, the 
derivatives are 

cf){a] 


d log Z 
d 2 log Z (j)[a\ 


8[r, 


l k,o\di 


^old^M 


The update rules for a gk and b gk are then 


Slog Z 

dm t% 


a 


-l 


0N 

$[al 


(29) 


• V 


fc,old 


[b> 


Ck- n 

= ^ m k ,old — 


d 2 lo g z\ dlogZ 


dm kMd 



cm] 2 

Once EP has converged we can approximate the NFCPD using Eq. (14) in the main text: 

p( z |X>,x,x*) « J p(z 0 \f)p(z 1 | ci) • • • p(z K | c K )q(f,Ci,.. .,c K ) 


(30) 


0 M X )] j 0 [/( x ) - /o] + |l- ri 0 [c fc (x)]}) 


di dci ■ ■ ■ dcK, 


(31) 
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where z = (/(x), ci(x),..., Cr’(x)), the first element in z is accessed using the index 0 and we have used the 
assumed independence of the objective and constraints to split p(z | f, ci,.., c k) into the product of the factors 

??(/( x ) I f ) andp(ci(x) | Ci),... ,p(c^(x) | c K ). 


4 Performing the integration in Eq. (31) 

Because the constraint factor on the right-hand side of Eq. (31) only depends on fo out of all the integration 
variables, we can rewrite Eq. (31) as 

p (ziD^v 1 ,.. .,r> x , x ,xi m) ) j ({n^]}^° - /°]+j 1 - n 0 [ 2fc ] 

p(z 0 I f) p{z\ I Cl) • • • p(z K | c*r) q( f, d,..., c K ) df x ■ ■ ■ df N dc x ■ ■ ■ dcx] df 0 , (32) 




where Z is a normalization constant. 

We now compute the inner integral in Eq. (32) analytically. This is possible because the marginal posterior 
predictive distributions on z are Gaussian and we also have a Gaussian approximation q( f, ci,..., c k)- We first 
rewrite the inner integral in Eq. (32) by using the definition of q in Eq. (10): 


K 

/ p(z 0 | f )p(zi | ci) • • • p(z K | c K )Af({ | m f , V f ) W(c fc | m Cfc , V Cfc )d/i • • • df N dc x ■ ■ ■ dc K (33) 
^ fe=l 

For each c k , the product of the conditional Gaussian distribution p(cfc(x) | c k ) with the Gaussian TV (c k | m^, V*,) 
is an (N + 2)-dimensional multivariate Gaussian distribution. All variables in c k are then integrated out leaving 
only a univariate Gaussian on c k (x) with mean and variance m' k and v' k respectively. For the variables /(x) and 
f, the product of p(/(x) | f) and TV(f | mo, Vo) yields an (N + 2)-dimensional multivariate Gaussian distribution. 
The variables /i,..., /jv are integrated out leaving only a bivariate Gaussian on the two-dimensional vector f' = 
(/(x), fo) with mean vector rn' f and covariance matrix V).. Thus, the result of the inner integral in Eq. (32) is 


J p(z o | f )p{z\ | ci) • • • p{z K | c K )AT(f | m 0 , V n ) ^ W(c fc | m k ,\ k )df 1 ■■■df N dc 1 --- dc K 

K 

= V (f' I rn'f, V£) TV (c fc (x) I m' k , v' k ) , 


fc=l 


(34) 


fc=l 


where, using Eqs. (3.22) and (3.24) of Rasmussen & Williams (2006) for the means and variances respectively, we 
have the definitions 

Kli = k Li( x ) T [ K i*\ lmf > 

Kla = [ mf ] 0 ’ 

= M X > X ) ^ k flnal( X ) T { [ K *,*] 1 + 

[ V f ] 2>2 = [V f ] 0 ,o > 

[ V f] 1,2 = M x > x i m) ) - k( ial (x) T {[Kf *]- 1 
m 'k = kfl na ,( x ) T [Kt] _1 m c \ 
v' k = A fc (x,x) - k£ nal (x) T { [K* *] 1 + 


[K{,*]^V f [Ki.J-'JkLW, 

+ KJ-'v^k^]- 1 } k^xi" 0 ), 
[K**] 1 V Cfc [K* J _1 } kg nal ( x ), 


and 


• m f and V f are the posterior mean and posterior covariance matrix of f given by q. 


1 



• kg nal (x) is the (TV + 1)-dimensional vector with the prior cross-covariances between /(x) and the elements 
of f given by /(x*), /(xi),..., f{x N ). 

• k{ * is an (TV + 1) x (AT + 1) matrix with the prior covariances between elements of f. 

• kg nal (x*) is the (TV+1(-dimensional vector with the prior cross-covariances between /(x+) and the elements 
off. 

• kg nal (x fc ) is an (TV + 1) vector with the cross-covariances between Cfc(x) and the elements of c/ ;; given by 
Cfc(x*),Cfc(xi), ... ,c fc (x„). 

• K *. * is an (TV + 1) x (TV + 1) covariance matrix with the prior covariances between the elements of c 
We have now computed the inner integral in Eq. (31), leaving us with our next approximation of the NFCPD: 

p(/(x), c i(x),..., c K (x) \V f ,V\...,V K ,x,x+) » 

({n e [ cfe ( x )]| e [/w - /o]+j 1 - n e t cfc ( x )] 

■I n N ( c *( x ) I i ( f ' I m f> v f) d fo • (35) 

V nk—1 ) 

Eq. (35) is the same as Eq. (15) in the main text. Note that the computations performed to obtain m' k , v' k , nig, and 
Vg do not depend on x. In the following section we make a final Gaussian approximation to Eq. (35), which must 
be performed for every x. 




5 Final Gaussian approximation to the NFCPD, for each x 

Because the right-hand side of Eq. (35) is not tractable, we approximate it with a product of Gaussians that have 
the same marginal means and variances. In particular, we approximate it as 

P (z | V, x, x*) « TV (/(x) | ^ FCPD (x), n? FCPD (x)) AT (c fc (x) | ^ FCPD (x), ^ FCPD (x)) , 

k =1 

where ph F c PD (x), fN F cpD( x ) anc * MnfcpdM anc ^ w nfcpd( x ) are the marginal means and marginal variances of 
/(x) and Cfc(x) according to the right-hand-side of Eq. (35). Using Eqs. (5.12) and (5.13) in Minka (2001) we can 
compute these means and variances in terms of the derivatives of the normalization constant Z in Eq. (35), which 
is given by 


K 


I< 


z=i n +\ 1 - n 


,fc=i 


*:=i 


s = [Vf]g-g + [Vf] 2)2 — 2 [Vg] 1)2 

[1, -l]m'g 




Oik = 


m j. 


( 36 ) 


where 



Doing so yields 


where 


z) NFCrD(x) = [V '] M -£(0 + a ) ([V^] M - [V'] 12 ) 2 
M f CPD (x) = K], + ([V'] M - [Vf] 12 ) A 

,r pD (x) = {^- i +a}- 1 

Mf CPD (x) = <,const( x ) {Kfc] _ 1 K ] _1 + &} , 


P 


b 

Pk 

d 2 logZ 
a [m' fc f 


{nf=i $ [ a fc]}^( c 


a 2 io g z _ , 

- 9 —|- 'Ul, 

a Imif 


Oik + Pk 


a | m k + 

P( a n) , y , N 

Pk {otk + /3fe} 


( 37 ) 


5.1 Initialization and convergence of EP 

Initially EP sets the parameters of all the approximate factors to be zero. We use at the convergence criterion that 
the absolute change in all parameters should be below 10 -4 . 


5.2 EP with damping 

To improve convergence we use damping (Minka & Lafferty, 2002). If ft,“ ew is the minimizer of the KL-divergence, 
damping entails using instead li^ amped as the new factor, as defined below: 

^ amped = [/CT + , (38) 

where h n is the factor at the previous iteration. We do the same for (jk ■ The parameter e controls the amount of 
damping, with e = 1 corresponding to no damping. We initialize e to 1 and multiply it by a factor of 0.99 at each 
iteration. Furthermore, the factors h n and are updated in parallel (i.e. without updating q in between) in order 
to speed up convergence (Gerven et al., 2009). During the execution of EP, some covariance matrices may become 
non positive definite due to an excessively large step size (i.e. large e). If this issue is encountered during an EP 
iteration, the damping parameter is reduced by half and the iteration is repeated. The EP algorithm is terminated 
when the absolute change in all the parameters in q is less than 10~ 4 . 
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